Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ikawrakow/ik_llama.cpp/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
On Debian/Ubuntu:CMake flags
Pass flags to the initialcmake -B build invocation.
| Flag | Default | Description |
|---|---|---|
GGML_NATIVE | OFF | Optimize for the host CPU (-march=native). Turn off when cross-compiling. |
GGML_CUDA | OFF | Build with CUDA support. Requires the NVIDIA CUDA Toolkit. Defaults to native CUDA architecture detection. |
CMAKE_CUDA_ARCHITECTURES | auto | Target a specific GPU compute capability, e.g. 86 for RTX 30-series. |
GGML_RPC | OFF | Build the RPC backend for distributed inference across machines. |
GGML_IQK_FA_ALL_QUANTS | OFF | Enable all KV cache quantization types for Flash Attention (beyond the default f16, q8_0, q6_0, and bf16). |
GGML_NCCL | ON | Enable NCCL for multi-GPU communication. Set to OFF to disable. |
LLAMA_SERVER_SQLITE3 | OFF | Build SQLite3 support into llama-server (required for the mikupad web UI). |
Environment variables
Set these in the shell before invokingllama-server or any other tool.
| Variable | Description |
|---|---|
CUDA_VISIBLE_DEVICES | Restrict which GPUs are visible. Example: CUDA_VISIBLE_DEVICES=0,2 uses the first and third GPU only. |
GGML_CUDA_ENABLE_UNIFIED_MEMORY | Set to 1 to enable CUDA Unified Memory, allowing the GPU to access host RAM when VRAM is exhausted. Useful for large models on systems with limited VRAM. |
The only fully supported compute backends are CPU (AVX2 or better, ARM NEON
or better) and CUDA. ROCm, Vulkan, and Metal are available but not actively
maintained.